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Abstract 

Modified policy iteration (MPI) is a dynamic 
programming (DP) algorithm that contains 
the two celebrated policy and value itera- 
tion methods. Despite its generality, MPI 
has not been thoroughly studied, especially 
its approximation form which is used when 
the state and/or action spaces are large or 
infinite. In this paper, we propose three im- 
plementations of approximate MPI (AMPI) 
that are extensions of well-known approxi- 
mate DP algorithms: fitted-value iteration, 
fitted-Q iteration, and classification-based 
policy iteration. We provide error propaga- 
tion analyses that unify those for approxi- 
mate policy and value iteration. On the last 
classification-based implementation, we de- 
velop a finite-sample analysis that shows that 
MPI's main parameter allows to control the 
balance between the estimation error of the 
classifier and the overall value function ap- 
proximation. 

1. Introduction 

Modified Pohcy Iteration (MPI) (Puterman & Shin, 
1978) is an iterative algorithm to compute the optimal 
policy and value function of a Markov Decision Process 
(MDP). Starting from an arbitrary value function vq, 
it generates a sequence of value-policy pairs 

TTfe+i =^Gvk (greedy step) (1) 

Vk+i = (7Vfc+i)™Wfc (evaluation step) (2) 
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where Qvk is a greedy policy w.r.t. Vk, TV^ is the Bell- 
man operator associated to the policy TTk, and m > 1 is 
a parameter. MPI generalizes the well-known dynamic 
programming algorithms Value Iteration (VI) and Pol- 
icy Iteration (PI) for values m = 1 and m — oo, respec- 
tively. MPI has less computation per iteration than PI 
(in a way similar to VI), while enjoys the faster conver- 
gence of the PI algorithm (Puterman & Shin, 1978). 
In problems with large state and/or action spaces, ap- 
proximate versions of VI (AVI) and PI (API) have 
been the focus of a rich literature (see e.g. Bertsekas 
k Tsitsiklis 1996; Szepesvari 2010). The aim of this 
paper is to show that, similarly to its exact form, ap- 
proximate MPI (AMPI) may represent an interesting 
alternative to AVI and API algorithms. 

In this paper, we propose three implementations of 
AMPI (Sec. 3) that generalize the AVI implementa- 
tions of Ernst et al. (2005); Autos et al. (2007); Munos 
& Szepesvari (2008) and the classification-based API 
algorithm of Lagoudakis & Parr (2003); Fern et al. 
(2006); Lazaric et al. (2010); Gabillon et al. (2011). We 
then provide an error propagation analysis of AMPI 
(Sec. 4), which shows how the _Lp-norm of its perfor- 
mance loss can be controlled by the error at each iter- 
ation of the algorithm. We show that the error prop- 
agation analysis of AMPI is more involved than that 
of AVI and API. This is due to the fact that neither 
the contraction nor monotonicity arguments, that the 
error propagation analysis of these two algorithms rely 
on, hold for AMPI. The analysis of this section unifies 
those for AVI and API and is applied to the AMPI im- 
plementations presented in Sec. 3. We detail the anal- 
ysis of the classification-based implementation of MPI 
(CBMPI) of Sec. 3 by providing its finite sample analy- 
sis in Sec. 5. Our analysis indicates that the parameter 
m allows us to balance the estimation error of the clas- 
sifier with the overall quality of the value approxima- 
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tion. We report some preliminary results of applying 
CBMPI to standard benchmark problems and compar- 
ing it with some existing algorithms in (Schcrrer et al., 
2012, Appendix G). 

2. Background 

We consider a discounted MDP {S, A, P, r, 7) , where S 
is a state space, A is a finite action space, P{ds'\s, a), 
for all (s,a), is a probability kernel on S, the re- 
ward function r : 5 x — > M is bounded by i?max, 
and 7 G (0, 1) is a discount factor. A determinis- 
tic policy is defined as a mapping t: : S A. For 
a policy vr, we may write r^(s) — r(s,7r(s)) and 
P.^{ds'\s) — P(c?s'|s, 7r(s)) . The value of policy tt in 
a state s is defined as the expected discounted sum 
of rewards received starting from state s and follow- 
ing the policy tt, i.e.,w^(s) = E[ I]tlo 7*^7r(st)| so = 
s, St+i ~ Pjrl'lst)] ■ Similarly, the action-value function 
of a policy tt at a state-action pair (s,a), Q7r(s,a), is 
the expected discounted sum of rewards received start- 
ing from state s, taking action a, and then following 
the policy. Since the rewards are bounded by i?max, 
the values and action-values should be bounded by 
V^nax = Qmax = ^max/(l " t)- The Bellman oper- 
ator of policy vr takes a function / on 5 as input 
and returns the function T^/ defined as Vs, [T7r/](s) = 
E[r^(s) + 7/(5') I s' ^ P7r(-|s)], or in compact form, 
T^/ = + "/PttI- It is known that v^^ is the unique 
fixed-point of Tn-. Given a function / on S, we say 
that a policy tt is greedy w.r.t. /, and write it as 
TT = g f, a Vs, (T^/)(s) = maXa(Ta/)(s), or equiv- 
alently T^^f ^ max^'(r^'/). We denote by the op- 
timal value function. It is also known that is the 
unique fixed-point of the Bellman optimality operator 
T : V maxjr Tt^v — Tg(i,)W, and that a policy tt* that 
is greedy w.r.t. is optimal and its value satisfies 

3. Approximate MPI Algorithms 

In this section, we describe three approximate MPI 
(AMPI) algorithms. These algorithms rely on a func- 
tion space J- to approximate value functions, and in 
the third algorithm, also on a policy space 11 to repre- 
sent greedy policies. In what follows, we describe the 
iteration k of these iterative algorithms. 

3.1. AMPI-V 

For the first and simplest AMPI algorithm presented 
in the paper, we assume that the values Vk are rep- 
resented in a function space C mI'^I. In any state 
s, the action 7rfe+i(s) that is greedy w.r.t. Vk can be 



estimated as follows: 

TTk+iis) = argmax — (^r^J) -K7Wfe(4-'^)), (3) 

where Va £ .4 and 1 < j < M, Ta'^ and si^'' are 
samples of rewards and next states when action a 
is taken in state s. Thus, approximating the greedy 
action in a state s requires M|yl| samples. The al- 
gorithm works as follows. It first samples TV states 
from a distribution /x, i.e., {s'*-'}fli ~ H- From 
each sampled state s^^\ it generates a rollout of size 
TO, i.e. , (s(*) , r^*'' , s^'\ . . . , a|^Li , r^jL 1 , Sm ) , where 
of' is the action suggested by TTfc+i in state sj'\ 
computed using Eq. 3, and r[^^ and s|^;^ are the re- 
ward and next state induced by this choice of ac- 
tion. For each s^*\ we then compute a rollout estimate 
Wfc+i(s(')) = Y.7=a 7*'"^''' +7™«fe(sm ), which is an un- 
biased estimate of [(T'-Trfc+i) Wfc] (s^'^). Finally, Vk+i 
is computed as the best fit in T to these estimates, 
i.e., 

Each iteration of AMPI-V requires N rollouts of size 
TO, and in each rollout any of the actions needs 
M samples to compute Eq. 3. This gives a total of 
7Vto(M|^|-|-1) transition samples. Note that the fitted 
value iteration algorithm (Munos & Szepesvari, 2008) 
is a special case of AMPI-V when to = 1. 

3.2. AMPI-Q 

In AMPI-Q, we replace the value function u : 5 — > M 
with an action-value function Q : S x A ^ M.. The 
Bellman operator for a policy tt at a state-action pair 
(s, a) can then be written as 

[T^Q]{s.a) = E[r,(s,a)+7Q(s',^(s'))|s' - P(-|s,a)], 
and the greedy operator is defined as 

TT = G Q Vs 7r(s) = arg max Qfs, a). 

aeA 

In AMPI-Q, action-value functions Qk are represented 
in a function space C mI"^^-^! , and the greedy action 
w.r.t. Qk at a state s, i.e., TTk+i{s), is computed as 

7rfc+i(s) e arg max Qfc(s,a). (4) 

aeA 

The evaluation step is similar to that of AMPI-V, 
with the difference that now we work with state- 
action pairs. We sample N state-action pairs from 
a distribution fx on S x A and build a rollout set 



Approximate Modified Policy Iteration 



Input: Value function space T , policy space IT, state 
distribution 

Initialize: Let tti G 11 be an arbitrary policy and 
uo G an arbitrary value function 
for fc = 1, 2, . . . do 

• Perform rollouts: 

Construct the rollout set Vk = {s''^}^=i, s^''^ ~ 
for all states s''^ € Vk do 

Perform a rollout and return '^^(s'*-') 
end for 

Construct the rollout set = {s'-^''}fLi, s^'' ~ M 
for all states s''^ G 2?^ and actions a & A do 

for j = 1 to M do 

Perform a rollout and return I^.{s^^\a) 

end for 

end for 

• Approximate value function: 

life G argmin£^(/I; «) (regression) 

• Approximate greedy policy: 

ivk+i G argmin tt) (classification) 

Tren 

end for 



Figure 1. The pseudo-code of the CBMPI algorithm. 

Vk = {(s(*',a(*))}^i, (s(*),aW) - fi. For each 
(s'-'-', a*^*-') G 2?fe, we generate a rollout of size to, 

i.e., {s'^'\a'^'-\r^^\s['\a''i\- ■ ■ ,s^r^ ,am), where the 
first action is a^^\ a|''' for /: > 1 is the action sug- 
gested by itk+i in state s). computed using Eq. 4, and 
r^*'' and s^^_^i are the reward and next state induced 
by this choice of action. For each (s^'\a^'^) G 1?^, we 
then compute the rollout estimate 

m— 1 

t=o 

which is an unbiased estimate of 
[(T,,+J"Qfc](s('),a(')). Finally, Qk+i is the best fit 
to these estimates in i.e.. 

Each iteration of AMPI-Q requires Nm samples, 
which is less than that for AMPI-V. However, it 
uses a hypothesis space on state-action pairs instead 
of states. Note that the fitted-Q iteration algo- 
rithm (Ernst et al., 2005; Autos et al., 2007) is a special 
case of AMPI-Q when to = 1. 

3.3. Classification-Based MPI 

The third AMPI algorithm presented in this paper, 
called classification-based MPI (CBMPI), uses an ex- 



plicit representation for the policies tt^,, in addition to 
the one used for value functions Vk ■ The idea is similar 
to the classification-based PI algorithms (Lagoudakis 
& Parr, 2003; Fern et al., 2006; Lazaric ct al., 2010; 
Gabillon et al., 2011) in which we search for the greedy 
policy in a policy space 11 (defined by a classifier) 
instead of computing it from the estimated value or 
action- value function (like in AMPI-V and AMPI-Q). 

In order to describe CBMPI, we first rewrite the MPI 
formulation (Eqs. 1 and 2) as 

Vk = {T.^i,)"^Vk-i (evaluation step) (5) 

TTk+i = g [{T^,rvk-i] (greedy step) (6) 

Note that in the new formulation both Vk and T^k+i 
are functions of (T^^ )'"wfe„i. CBMPI is an approxi- 
mate version of this new formulation. As described 
in Fig. 1, CBMPI begins with arbitrary initial policy 
TTi G n and value function vq G J^.^ At each iteration 
k, a new value function Vk is built as the best approx- 
imation of the TO-step Bellman operator (T^^)™w^._i 
in J- (evaluation step). This is done by solving a re- 
gression problem whose target function is (T^^)™Ufc_i. 
To set up the regression problem, we build a rollout 
set Vk by sampling n states i.i.d. from a distribution 
fi.^ For each state s'-'^ G Vk, we generate a roU- 
out {s^'>, ay ,ry,s\ a'J_^j^,r'J^j^, sin') of size to, 
where a[^^ — 'nk{st^\ and rf' and s^^]^^ are the reward 
and next state induced by this choice of action. From 
this rollout, we compute an unbiased estimate Wfe(s*-*^) 
of [(T;j"i^fc_i](sW) as 

m — 1 

Vk{.s^%=Y.l'rf +^^Vk^M::i). (7) 

and use it to build a training set { (s^*^ Ufc(s^'))) }"_^. 
This training set is then used by the regressor to com- 
pute Vk as an estimate of (T^^ )'"wfe„i. 

The greedy step at iteration fc computes the policy TTfe+i 
as the best approximation of Q\(TT^^™Vk-]\ by solv- 
ing a cost-sensitive classification problem. From the 
definition of a greedy policy, if tt = [(r^j^)™Wfc„i] , 
for each s G 5, we have 

[T,(T, J™«fe_i] (s) - max \T,{T^J^Vk-r\ {s). (8) 

By defining (9fe(s,a) [rQ(T^ J™i;fc_i] (s), we may 

'^Note that the function space T and policy space 11 are 
automatically defined by the choice of the regressor and 
classifier, respectively. 

^Here we used the same sampling distribution for both 
regressor and classifier, but in general different distribu- 
tions may be used for these two components. 
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rewrite Eq. 8 as 



)fc(s, 7r(s)) = max(5fe(s,a). 



(9) 



The cost-sensitive error function used by CBMPI is of 
the form 

'^^k.^k-l(^^''^) = J [^1^4 - Qfc(s,7r(s))j^i(ds). 

To simphfy the notation we use £^ instead of 
'^7?fciifc_i- To set up this cost-sensitive classification 
problem, we build a rollout set I?^ by sampling N 
states i.i.d. from a distribution fi. For each state 
s*-*-* e V'l^ and each action a £ A, we build M inde- 
pendent rollouts of size m -I- 1, i.e.,'^ 

(M) „ ^(^J) „(»J)^^ 

where for t > \, a[^''-'^ = 7rfc(s|*'^^), and r^''''^ and 
St+i are the reward and next state induced by this 
choice of action. From these rollouts, we compute 
an unbiased estimate of (5fc(s^'\a) as Qfc(s^*^,a) = 

wJ:-LiRU^^'\<^) where 

m 

t=o 

Given the outcome of the rollouts, CBMPI uses a cost- 
sensitive classifier to return a policy T^k+i that mini- 
mizes the following empirical error 

1 

£k{n;^) = [maxQ45W,a) -Q,(s«,^(s«))], 

with the goal of minimizing the true error £^(/i;7r). 

Each iteration of CBMPI requires nm+M\A\N{m+l) 
(or M\A\N{m + 1) in case we reuse the rollouts, see 
Footnote 3) transition samples. Note that when m 
tends to 00, we recover the DPI algorithm proposed 
and analyzed by Lazaric et al. (2010). 

4. Error propagation 

In this section, we derive a general formulation for 
propagation of error through the iterations of an AMPI 
algorithm. The line of analysis for error propagation 
is different in VI and PI algorithms. VI analysis is 
based on the fact that this algorithm computes the 
fixed point of the Bellman optimality operator, and 
this operator is a 7-contraction in max-norm (Bcrt- 
sekas & Tsitsikhs, 1996; Munos, 2007). On the other 



^We may implement CBMPI more sample efficient by 
reusing the rollouts generated for the greedy step in the 
evaluation step. 



hand, it can be shown that the operator by which PI 
updates the value from one iteration to the next is not 
a contraction in max-norm in general. Unfortunately, 
we can show that the same property holds for MPI 
when it does not reduce to VI (i.e., m > 1). 

Proposition 1. // m > 1, there exists no norm for 
which the operator that MPI uses to update the values 
from one iteration to the next is a contraction. 

Proof. Consider a deterministic MDP with two states 
{•Si:'S2}, two actions {change, stay}, rewards r{si) = 
0,r{s2) = 1, and transitions Pc/t(s2|si) = Pch{si\s2) = 
Pst{si\si) = Pst{s2\s2) = 1- Consider the following 
two value functions v = (e, 0) and v' = (0, e) with e > 
0. Their corresponding greedy policies are tt = (st, ch) 
and tt' — {ch, st), and the next iterates of v and w' can 

7™e 



be computed as (Tt^Y'^v 



1 +7™e 



and (T^')"'v' 




while v' — V = ^ . Since e can be arbitrarily small, 

the norm of {Tt^i)'^v' — {T^)™v can be arbitrarily larger 
than the norm oi v — v' as long as m > 1. □ 



We also know that the analysis of PI usually relies on 
the fact that the sequence of the generated values is 
non-decreasing (Bcrtsckas & Tsitsiklis, 1996; Munos, 
2003). Unfortunately, it can easily be shown that for 
TO finite, the value functions generated by MPI may 
decrease (it suffices to take a very high initial value). 
It can be seen from what we just described and Propo- 
sition 1 that for TO ^ 1 and 00, MPI is neither contract- 
ing nor non-decreasing, and thus, a new line of proof is 
needed for the propagation of error in this algorithm. 

To study error propagation in AMPI, we introduce an 
abstract algorithmic model that accounts for potential 
errors. AMPI starts with an arbitrary value wq and 
at each iteration k > 1 computes the greedy policy 
w.r.t. Vk-i with some error e'^,, called the greedy step 
error. Thus, we write the new policy tt/c as 

TTfe = Qe'^Vk-l. (10) 

Eq. 10 means that for any policy tt', 

TT,'Vk-i < TTTfeUfc-i -I- e'fc. 

AMPI then generates the new value function Vk with 
some error e^, called the evaluation step error 



Vk = (T^J^-W/c-l + Efe- 



(11) 



Before showing how these two errors are propagated 
through the iterations of AMPI, let us first define them 
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in the context of each of the algorithms presented in 
Section 3 separately. 

AMPI-V: Efc is the error in fitting the value function 
Vk- This error can be further decomposed into two 
parts: the one related to the approximation power 
of J- and the one due to the finite number of sam- 
ples/rollouts, e'j, is the error due to using a finite num- 
ber of samples M for estimating the greedy actions. 

AMPI-Q: e'j. = and is the error in fitting the 
state-action value function Qk- 

CBMPI: This algorithm iterates as follows: 

Vk = (r^J"wfc_i -I- 

Unfortunately, this does not exactly match with the 
model described in Eqs. 10 and 11. By introducing 

the auxiliary variable Wk = (T^j.)™Ufc--i, we have Vk = 
Wk + Efc, and thus, we may write 

= Ge'^^^ [wk] ■ (12) 

Using Vk-i = Wk-i + e/c-i, we have 

Wk = [Trr^rvk-i = (r,J"(u>fc_i + ek-i) 

^[T,J^Wk^i + {lP.,T\k-i. (13) 

Now, Eqs. 12 and 13 exactly match Eqs. 10 and 11 by 
replacing Vk with Wk and with ('jP-„^)"^ek-^i- 

The rest of this section is devoted to show how the 
errors ej. and e'j, propagate through the iterations of an 
AMPI algorithm. We only outline the main arguments 
that will lead to the performance bound of Thm. 1 and 
report most proofs in (Scherrer et al., 2012). We follow 
the line of analysis developped by Thicry & Scherrer 
(2010). The results are obtained using the following 
three quantities: 

1) The distance between the optimal value function 
and the value before approximation at the fc"^ itera- 
tion: dfe = - (TTrJ^Wfe-l = - (Wfc - Efc). 

2) The shift between the value before approximation 
and the value of the policy at the fc**^ iteration: Sk = 
(T^J^Wfc-i - WTTfc = [vk - efc) - v^^. 

3) The Bellman residual at the fc**^ iteration: bk = 
Vk - T^^+^Vk. 

We are interested in finding an upper bound on the 

loss Ik = Vif — v-j^k = dk + Sk- To do so, we will up- 
per bound dk and Sk, which requires a bound on the 
Bellman residual bk- More precisely, the core of our 
analysis is to prove the following point-wise inequali- 
ties for our three quantities of interest. 



Lemma 1 (Proof in (Scherrer ct al., 2012, Ap- 
pendix A)). Let k > I, Xk ^ {I - lP-Rk)^k + 4+1 
and Uk = —jP-rvt^k + eJc+i- W^e have: 

bk < {iP^,rbk^i+Xk, 

m— 1 

<iP^,dk + yk+Y.^^P^^ 

sk^{ip.,r{i-ip.,)-'hk^i- 

Since the stochastic kernels are non-negative, the 
bounds in Lemma 1 indicate that the loss Ik will be 
bounded if the errors and e'^, are controlled. In fact, 
if we define e as a uniform upper-bound on the errors 
|efe| and |e'j,|, the first inequality in Lemma 1 implies 
that bk < 0(e), and as a result, the second and third 
inequalities gives us dk < 0(e) and Sk < 0(e). This 
means that the loss will also satisfy Ik < 0(e). 

Our bound for the loss Ik is the result of careful ex- 
pansion and combination of the three inequalities in 
Lemma 1. Before we state this result, we introduce 
some notations that will ease our formulation. 

Definition 1. For a positive integer n, we define P„ as 
the set of transition kernels that are defined as follows: 

1) for any set of n policies {tti, . . . , 7r„}, 
(7P,J(7P,J...(7P,J eP„, 

2) for any a G (0, 1) and (^1,^2) G P« x P„, aPi + 
(l-a)P2 eP„. 

Furthermore, we use the somewhat abusive notation 
for denoting any element o/P„. For example, if we 
write a transition kernel P as P — qiF' -f- a2r-'T'^ = 
QfiF* + Q;2r''"'"'', it should be read as there exist Pi € P^, 
P2 e Vj, P3 e Pfc, and P4 e Vk+j such that P = 
aiPi + 02/2 -P3 = ciiPi + 012PA- 

Using the notation introduced in Definition 1, we now 
derive a point-wise bound on the loss. 

Lemma 2 (Proof in (Scherrer et al., 2012, Ap- 
pendix B)). After k iterations, the losses of AMPI-V 
and AMPI-Q satisfy 

k—l 00 k—1 00 

< 2 ^ ^ F^ |efc_,| + E E l^fe-l + Mfc), 

i—1 j—i i—0 j—i 

while the loss of CBMPI satisfies 

k — 2 00 k—1 00 

ik<2j2 E r^\^k-^-l\ + T.T.^'\''k-^\ + m, 

i—l j—i-\-m i—0 j—i 

where h{k) = 2 ^^1^ F^ldol or h{k) = 2Eilfcr'>o|- 
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Remark 1. A close look at the existing point-wise 
error bounds for AVI (Munos, 2007, Lemma 4.1) and 
API (Munos, 2003, Corollary 10) shows that they do 
not consider error in the greedy step (i.e., ej. — 0) and 
that they have the following form: 



limsupfc^^/fc < 21imsupfe_^^2^2^P|efe_,|. 

1=1 j=i 

This indicates that the bound in Lemma 2 not only 
unifies the analysis of AVI and API, but it generalizes 
them to the case of error in the greedy step and to a 
finite horizon k. Moreover, our bound suggests that 
the way the errors are propagated in the whole family 
of algorithms VI/PI/MPI does not depend on m at 
the level of the abstraction suggested by Definition 1.^ 

The next step is to show how the point-wise bound of 
Lemma 2 can turn to a bound in weighted Lp-norm, 
which for any function / : 5 — >■ K and any distribu- 
tion ^ on 5 is defined as ||/||p,;i = (/ |/(a:)|P/i(dx))^^^. 
Munos (2003; 2007); Munos & Szepesvari (2008), and 
the recent work of Farahmand et al. (2010), which pro- 
vides the most refined bounds for API and AVI, show 
how to do this process through quantities, called con- 
centrability coefficients, that measure how a distribu- 
tion over states may concentrate through the dynamics 
of the MDP. We now state a lemma that generalizes 
the analysis of Farahmand et al. (2010) to a larger class 
of concentrability coefficients. We will discuss the po- 
tential advantage of this new class in Remark 4. We 
will also show through the proofs of Thms. 1 and 3, 
how the result of Lemma 3 provides us with a flex- 
ible tool for turning point-wise bounds into L^-norm 
bounds. Thm. 3 in (Scherrer et al., 2012, Appendix D) 
provides an alternative bound for the loss of AMPI, 
which in analogy with the results of Farahmand et al. 
(2010) shows that the last iterations have the high- 
est impact on the loss (the influence exponentially de- 
creases towards the initial iterations). 
Lemma 3 (Proof in (Scherrer et al., 2012, Ap- 
pendix C)). Let X and {Siji^i he sets of positive in- 
tegers, {Ii,...,2n} be a partition of I, and f and 
(<?i)iei be functions satisfying 

n 

Then for all p, q and q' such that g + ^ ^ l - '^^'^ /'^'^ 
all distributions p and p,, we have 

n 

\\f\\p,p < E Mf" Bup ||g.|!p,',, E E 



*Note however that the dependence on m will reappear 
if we make explicit what is hidden in the terms r-* . 



with the following concentrability coefficients 

P .,^ A Z^isij ^jgj. 7^^9(7) 
'^iv-i ^ -vJ ' 

with the Radon- Nikodym derivative based quantity 



dp 



(14) 



We now derive a Lp-norm bound for the loss of the 
AMPI algorithm by applying Lemma 3 to the point- 
wise bound of Lemma 2. 

Theorem 1 (Proof in (Scherrer et al., 2012, Ap- 
pendix D)). Let p and p be distributions over states. 
Let p, q, and q' be such that ^ + ^ — 1- After k 
iterations, the loss of AMPI satisfies 



ll'ft||p,P < 



2(7-7'°)(C,^-"-°)^ 
(1-7)^ 

(1-7')(C°''='")^ 



sup 

l<j<k~l 



(15) 



(1-7)^ 

while the loss of CBMPI satisfies 



sup llejIlpq'.M 
l<3<k 



k \\p,p 



27-(-y_/-l) (C2-''''™)P 

{1 - j") {cl'"-")-- 



sup II Ej Wpq' ,f_i 
l<j<k-2 

(16) 



(1-7)2 



sup I 

i<j<fc 



+ ff(fc), 



where for all q, I, k and d, the concentrability coeffi- 
cients Cq^''^ are defined as 



A (1-7)^ 



with Cq{j) given by Eq. 14, and g{k) is defined as 

g{k)^^{cwy 



min(||dollpq',M'll^o||pg',M)- 



Remark 2. When p tends to inflnity, the flrst bound 
of Thm. 1 reduces to 



\\ik\\oo < sup lie,' 

(L — 7j i<j<k-l 

27*^ 

+ min(||do||cx), ||&o| 



+ 



- 7 



n ^2 sup 

(1 - 7) i<j<k 



(17) 



When k goes to infinity, Eq. 17 gives us a general- 
ization of the API (to = 00) bound of Bertsekas & 
Tsitsiklis (1996, Prop. 6.2), i.e., 

11,11 27sup. ||ej||^ +sup. Ilejll^ 

hmsup /fc 00 < 7z — . 

k^oo (I-7) 

Moreover, since our point-wise analysis generalizes 

those of API and AVI (as noted in Remark 1), the 

Lp-bound of Eq. 15 unifies and generalizes those for 

API (Munos, 2003) and AVI (Munos, 2007). 
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Remark 3. Canbolat & Rothblum (2012) recently 
(and independently) developped an analysis of an ap- 
proximate form of MPI. Also, as mentionned, the proof 
technique that we used is based on that of Thicry & 
Scherrcr (2010). While Canbolat & Rothblum (2012) 
only consider the error in the greedy step and Thiery 
& Scherrcr (2010) that in the value update, our work is 
more general in that we consider both sources of error 
- this is required for the analysis of CBMPl. Thiery 
& Scherrcr (2010) and Canbolat & Rothblum (2012) 
provide bounds when the errors are controlled in max- 
norm, while we consider the more general Lp-norm. 
At a more technical level, Th. 2 in (Canbolat & Roth- 
blum, 2012) bounds the norm of the distance — Vk 
while we bound the loss — v-^^, . If we derive a bound 
on the loss (using e.g., Th. 1 in (Canbolat & Roth- 
blum, 2012)), this leads to a bound on the loss that 
is looser than ours. In particular, this does not allow 
to recover the standard bounds for AVI/ API, as we 
managed to (c.f. Remark 2). 

Remark 4. We can balance the influence of the con- 
centrability coefficients (the bigger the g, the higher 
the influence) and the difficulty of controlling the er- 
rors (the bigger the q', the greater the difficulty in 
controlling the Lpg' -norms) by tuning the parameters 
q and g', given the condition that ^ + ^ = 1- This 
potential leverage is an improvement over the existing 
bounds and concentrability results that only consider 
specific values of these two parameters: q = oo and 
q' = 1 in Munos (2007); Munos & Szcpesvari (2008), 
and q — q' — 2 in Farahmand et al. (2010). 
Remark 5. For CBMPI, the parameter m controls 
the influence of the value function approximator, can- 
celling it out in the limit when m tends to infinity 
(see Eq. 16). Assuming a fixed budget of sample tran- 
sitions, increasing m reduces the number of rollouts 
used by the classifier and thus worsens its quality; in 
such a situation, m allows to make a trade-off between 
the estimation error of the classifier and the overall 
value function approximation. 

5. Finite-Sample Analysis of CBMPI 

In this section, we focus on CBMPI and detail the pos- 
sible form of the error terms that appear in the bound 
of Thm. 1. We select CBMPI among the proposed al- 
gorithms because its analysis is more general than the 
others as we need to bound both greedy and evaluation 
step errors (in some norm), and also because it displays 
an interesting influence of the parameter m (see Re- 
mark 5). We first provide a bound on the greedy step 
error. From the definition of e'^ for CBMPI (Eq. 12) 
and the description of the greedy step in CBMPI, we 
can easily observe that He'fcHi./i = £^_^(^;7rfc). 



Lemma 4 (Proof in (Scherrcr et al., 2012, Ap- 
pendix E)). Let H be a policy space with finite VC- 
dimension h — VC{IV) and fi be a distribution over the 
state space S. Let N he the number of states in 'D'j^_^ 
drawn i.i.d. from fj,, M be the number of rollouts per 
state-action pair used in the estimation ofQ^^i, and 
TTfc = argmin^gjj £]?_]^(/i, tt) be the policy computed at 
iteration k — 1 of CBMPL Then, for any d > 0, we 
have 

I kfe 1 1 1 ,M - ^fc- 1 (m; ^fe ) < inf 1 (/i; ^) + 2 (e'l -I- 4 ) , 

with probability at least 1 — S, where 

e[{N,5) = 16Q,„axy^ (/i log ^+ logy) , 

,^ / 2 , eMN \ 327 

(iV, M, 5) = 8Q„,ax \IjYn^ IT + y ) • 

We now consider the evaluation step error. The eval- 
uation step at iteration k of CBMPI is a regression 
problem with the target {TTr^)"^Vk-i and a training 
set {(s^*',w/c(s'*''))}"^^ in which the states s^*^ are 
i.i.d. samples from fi and Wfc(s*-*') are unbiased esti- 
mates of the target computed according to Eq. 7. Dif- 
ferent function spaces (linear or non-linear) may 
be used to approximate {T.^^)"^Vk-i. Here we con- 
sider a linear architecture with parameters a G and 
bounded (by L) basis functions i, ||</'j||oo 

< L. 

We denote by : <Y ^ 4>{-) = ((^i(-), . • . , <Pd{.-)Y 
the feature vector, and by T the linear function space 
spanned by the features (^j, i.e., J- — {/q(-) = (t){-)^ a : 
a £ M''}. Now if we define as the truncation (by 
T^nax) of the solution of the above linear regression 
problem, we may bound the evaluation step error us- 
ing the following lemma. 

Lemma 5 (Proof in (Scherrcr et al., 2012, Ap- 
pendix F)). Consider the linear regression setting de- 
scribed above, then we have 

\\ek\k^ < 4 inf ||(T.J™«fc-i - fh.,^ ei £2, 
with probability at least 1 — 5, where 

( A> qoT/ / 27(12e2n)2(d+^ 
ei(n,d) = 32KnaxY - log ( ^ j , 

e2{n,5) = 24(^Vmax + ||a,||2 • sup ||(?!)(a;)||2) log ^ , 

and Oisf is such that /q,^ is the best approximation 
(w.r.t. fi) of the target function (T^j.)'"i;fc_i in F . 

From Lemmas 4 and 5, we have bounds on ||ey|i,/i 
and ||efc||i,Ai < ||efe|l2,M- By a union bound argument. 
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we thus control the r.h.s of Eq. 16 in Li norm. In the 
context of Th. 1, this means p — 1, q' = 1 and q = oo, 
and we have the foUowing bound for CBMPI: 

Theorem 2. Let d' — sup^gj^ infTren ^Cj^, ^(/i; tt) 
and dm = sup^^jr „ inf/g_F ||(r^)™g - /||2^p. With 
the notations of Th. 1 and Lemmas 4-5, after k it- 
erations, and with probability 1 ~ S, the expected loss 
Ef,[k] = WhWi,,, of CBMPL is bounded by 



k-l\ri2,k,m 



27'"(7-7'"^)C 
(1-7)^ 

(1 - t'Ic^'-" 



dm + ei(n, ^) + e2(n, ^) 



Remark 6. This result leads to a quantitative ver- 
sion of Remark 5. Assume that we have a fixed 
budget for the actor and the critic B — nm = 
NM\A\m. Then, up to constants and logarith- 

< 



mic factors 

O (^7'" (rfm 



the bound has the form _ 



V B I ^ ^ \J B 

trade-off in the tuning of m: a big m can make the in- 
fluence of the overall (approximation and estimation) 
value error small, but that of the estimation error of 
the classifier bigger. 

6. Summary and Extensions 

In this paper, we studied a DP algorithm, called mod- 
ified policy iteration (MPI) , that despite its generality 
that contains the celebrated policy and value itera- 
tion methods, has not been thoroughly investigated in 
the literature. We proposed three approximate MPI 
(AMPI) algorithms that are extensions of the well- 
known ADP algorithms: fitted-value iteration, fitted- 
Q iteration, and classification-based policy iteration. 
We reported an error propagation analysis for AMPI 
that unifies those for approximate policy and value 
iteration. We also provided a finite-sample analysis 
for the classification-based implementation of AMPI 
(CBMPI), whose analysis is more general than the 
other presented AMPI methods. Our results indi- 
cate that the parameter of MPI allows us to control 
the balance of errors (in value function approximation 
and estimation of the greedy policy) in the final per- 
formance of CBMPI. Although AMPI generalizes the 
existing AVI and classification-based API algorithms, 
additional experimental work and careful theoretical 
analysis are required to obtain a better understanding 
of the behaviour of its different implementations and 
their relation to the competitive methods. Extension 
of CBMPI to problems with continuous action space 
is another interesting direction to pursue. 
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Supplementary Material for 
Approximate Modified Policy Iteration 

A. Proof of Lemma 1 

Before we start, we recall the following definitions: 

bk = Wfc -T^TTfe+iVfe, dfc = - (T^J"i;A;_i - {vk~ek), Sk = (r^J"ufe-i - w^^^ = {vk-ek)-v^^. 

Bounding bk 

(a) , 

bk = Vk- T^^+iVk ^Vk- T^;,Vk + T^^Vk - T^^+^Vk < - T^^Vk + ^k+i 

^Vk- €k- T^^Vk + "fPTT^f^k + £k- lP-,Vk^k + eJc+i = Vk - ek- T„,,{vk - f-k) + {I - 7-P7rJefc + 4+i- (18) 



Using the definition of Xk, i.e. 
we may write Eq. (f8) as 



Xk = {I-lP.Jek + ei+„ (19) 



bk <Vk-ek~ T^fe(wfe - Cfe) + Xk = (TVj™ffe-i - r^jr^J"wfc-i + Xk ^ (T^TrJ^Wfe-i - (7Vj"(r^fcWfc-i) + Xk 
= (7P,J™K_i - T^.vk^i) + Xk^ {-fP^,rbk-i + Xk. (20) 

(a) From the definition of e'l^^i, we have Vvr' T-^'Vk < T-n-fc+i^fe + £fe+ii thus this inequality holds also for tt' = tt^. 

(b) This step is due to the fact that for every v and v\ we have T^^ {v + w') — T^^v + ^P^^v' . 

(c) This is from the definition of 6^, i.e., Wfc = (T^r^ )'"wfe_i + e^. 

Bounding dfc 

dk+i - (T^k+iV^k = Ttt.i'* - T^,Vk + T^,Vk - Tj^^^^Vk + T^i,+iVk - (T^^.+J^Wfc 

(a) 

< 7^7r. (l'* -""*;)+ e'k+i + 9k+l = IPtt, {v* - Wfe) + lPTT,<^k - iP^T.f^k + 4+1 + 5fc+i 

m— 1 

= iP-K, iy^ - {vk - Efc)) +yk + 9k+i = jP-K.dk + yk+ gk+i ^ iP-n,dk + yfc + ^ iiPTTk+iYbk. (21) 

(a) This step is from the definition of e'^_|_j^ (see step (a) in bounding 6^) and by defining gk+i as follows: 



gk+i^T,,^,Vk-iT,,^,rvk. (22) 



(b) This is from the definition of yk, i.e. 



(c) This step comes from rewriting gk+i as 



Vk^-lPTv.ek + e'k+i- (23) 



gk+i = T^,^,vk - {T^.^T^k = [(^-.+i)'^fe - {T.^^^y^'vk] = [iT^^^y^k - {T.,,y{T^,,,Vk)\ 

m— 1 m— 1 

= E (7^-..J''(«fe - T^.+i^'fe) = E (7^-.+J-''^fe- (24) 
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Bounding With some slight abuse of notation, we have 
and thus: 



oo 

3=0 j=0 

= {jp^,r{i-jp^,)-\vk-i-T^,vk-i) = i'yp^.rii-jp^.r'h. (25) 

(a) For any v, we have Vt^^ = {T„^)°°v. This step follows by setting v = Vk-i, i.e., v^^^ = (T^j^)°°Ufe_i. 
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B. Proof of Lemma 2 

We begin by focusing our analysis on AMPI. Here we are interested in bounding the loss Ik — v^, — Vt^^, = dk + Sfc. 
By induction, from Eqs. (20) and (21), we obtain 

k 

bk < J2 r"''''~'^a;, + r'"'=5o, (26) 

i=i 

k—l m — 1 

dk < E r'"'"^ {y, + E + r'^do- (27) 

3=0 1=1 

in which we have used the notation introduced in Definition 1. In Eq. (27), we also used the fact that from 
Eq. (24), we may write gk+i — J2j=i T-'^fc- Moreover, we may rewrite Eq. (25) as 

oo oo 

sk = r-^pfofc^i = ^r"+^6fe_i. (28) 

j=0 j=0 

Bounding Ik From Eqs. (26) and (27), we may write 

k—l / m — 1 j \ 

j=0 V 1 = 1 i=l J 

k k — l m—1 j 

= J2r-'yk-^ + E E Y^V^-'-'+'^-^^^-^^x, + Zk, (29) 
i=i j=a 1=1 1=1 

where we used the following definition 

k—lrn—l mk—1 

^fc = E E r^-i+'+^"("-i)6o + r'^do = E r^^o + r'^do. 

j=a 1=1 i=k 

The triple sum involved in Eq. (29) may be written as 

k — lm — l j k—lk—lm—1 k — l mk—1 

^ ^ ^ pfc-i-j+i+mo-i)^. = E E E r''"^+'+-''('""^^""*.Ti = E E r-^'^^'xi 

j=Q 1 = 1 i=l i=l j=i 1=1 i=l j=mi+k — i 

k-lm(k-i)-l k-lmi-1 

= E E r^'^. = E E r^^fc- (30) 

2—1 j — k—i i—1 j—i 

Using Eq. (30), we may write Eq. (29) as 

k k— 1 mi— 1 

4 <Er'"'yfe-'+E E ^'^k-^ + zk■ (31) 

i—l i—1 j—i 

Similarly, from Eqs. (28) and (26), we have 

oo k — l oo A;— 1 

sfe < E ^"^^ ( E r™^''"^"*^a;, + r^e^-i^&o) = E ( E r^+^+^e^-i-^^a;, + r^+^+^e^-^^^o) 

j=0 1=1 j=0 1=1 

k — l oo oo k—l oo oo k—l oo 

= E E r-''+"^'"*^^^ + E = E E + E ^^'^o = E E r'^fe-. + 4, (32) 

i—1 j—0 3=0 i—1 j—0 j—mk i—1 j—mi 
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where we used the fohowing definition 

CO 

j — mk 

Finally, using the bounds in Eqs. (31) and (32), we obtain the following bound on the loss 

A: k—1 mi — 1 oo 

ik<dk + sk<J2 r'^'yfc-. + E ( E + E + + 4 

2—1 i—1 j — i j—rni 

k k— 1 oo 

= Y.r-^yk-^ + J2J2^'''k-^+Vk, (33) 

i—1 i—1 j—i 

where we used the following definition 

oo 

Vk = Zk + zi=Y,^^bo + T''do. (34) 

j=k 

Note that we have the following relation between and dg 

bo ^vo- T^^vq = - + T^,_^v.^. - T^.wq + T^.vq - T-^^vq < (/ - lP-^,){-da) + e'l, (35) 

In Eq. (35), we used the fact that v.^, — Ttt^u*, eo = 0, and Ttt.wo — T^i^o < e'l (this is because the policy tti is 
e'j^-greedy w.r.t. vq). As a result, we may write \r]k\ either as 

oo oo oo oo 

\Vk\<Y.^'[{I-jP.J\do\ + \e[\]+T'\do\<Y.^'[{I + r')\do\ + \e[\]^ (36) 

j=k j = k j=k j=k 

or using the fact that from Eq. (35), we have do < {I — 7-P7rJ^^(— &o + e'l), as 

oo oo oo oo oo oo 

\Vk\<Y.T^\bo\+r^Y.(^P.J'i\^o\ + \e[\)^Y.^^\bo\+r'^Y.^^{\bo\ + \e[\^ (37) 

j=k j=0 j = k j=0 j = k j=k 

Now, using the definitions of Xk and yk in Eqs. (19) and (23), the bound on |?7fc| in Eq. (36) or (37), and the fact 
that eo = 0, we obtain 

k k — 1 oo 

i=l i—1 j—i 

k—1 oo k—1 oo oo 

i— 1 j—i i—1 j—i j—k 

fc — loo fc — loo oo /c— loo A.;— loo 

= 2EEr'i^'»-i + E E r^i4-,+ii+ E r^w^\ + h{k) = 2J2J2^'\ek-.\ + J2J2^'\''k-^\ + m, 

i—1 j—i i—1 j—i — 1 j — k — 1 i—1 j—i i—0 j—i 

(38) 

where we used the following definition 

oo oo 

h{k)^2j2^'\do\, or /i(fc) = 2Er^'l^o|- 

j=k j=k 

We end this proof by adapting the error propagation to CBMPI. As expressed by Eqs. 12 and 13 in Sec. 4, 
an analysis of CBMPI can be deduced from that we have just done by replacing Vk with the auxiliary variable 



Approximate Modified Policy Iteration 



Wk = {T^^)"^Vk-i and with (7-P7rfc)'"efe-i = ^"^^k-i- Therefore, using the fact that eq — 0, we can rewrite the 
bound of Eq. 38 for CBMPI as foUows: 

k — 1 oo k—1 oo 

2 — 1 j—i i—0 j—i 

k — 2 oo k—1 oo 

= 2E E r^kfc^.-il + EEr'l^fc-l + M^)- (39) 

i—1 j—7n-\-i i—0 j—i 
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C. Proof of Lemma 3 

For any integer t and vector z, the definition of F* and tlie Holder's inequality imply that 

pF*|z| = ||r*|z|||^_^ < l%mz\W,,^l'c,{t) {f^\zfY- (40) 

We define 



1=1 \iex, jeJi 



ry3 



where {^^jj^^^^ is a set of non-negative numbers that we will specify later. We now have 



K \ K 



- K K 



(b) 

< RP 



ELi6E,:,x,E,,^.7%(j) (m(^) 
R 

Er=i 6 E.ei, E,e^, 7^c,(j) ^ 



.pq \ i' 



R 



^ ^^^ ELi6 (E»gi, Ejg j. 7^cg(j)) ( 



is: 

'supjgx^ llffi 



if 



where (a) results from Jensen's inequality, (b) from Eq. 40, and (c) from the definition of Cq{l). Now, by setting 
6 = (Cg(0)^^^sup,g2;^ obtain 



where the last step follows from the definition of R. 



Il/IIL < ^ = 
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D. Proof of Theorem 1 &; other Bounds on the Loss 

Proof. We only detail the proof for AMPI (the proof being similar for CBMPI). We define I = {1, 2, • • • , 2k}, 
the partition X — {Ii,l2,2;^3} as Ii = {1, . . . , fc — 1}, I2 — {k, . . . ,2k — 1}, and I3 — {2k}, and for each i G I 

( 2ek-^ if l<i<fc-l, ( + !,■■■} if 1 < z < fc - 1, 

= < efe-(j-fc) if fc<i<2fc-l, and Ji = < {i - k,i - k + 1, ■ ■ ■} if k<i<2k-l, 
[ 2do (or 26o) if i = 2fc, [ {k,k + !,■■■} if i = 2A:. 

Note that here we have divided the terms in the point-wise bound of Lemma 2 into three groups: the evaluation 
error terms {£j}^Zi7 greedy step error terms {e'j}^^^, and finally the residual term h{k). With the above 
definitions and the fact that the loss Ik is non-negative, Lemma 2 may be rewritten as 

1=1 iex, jeJi 

The result follows by applying Lemma 3 and noticing that J2i=io X]jli7"' — Ji^-r)^ ' ^ 

Here in oder to show the flexibility of Lemma 3, we group the terms differently and derive an alternative Lp- 
bound for the loss of AMPI and CBMPL In analogy with the results of Farahmand ct al. (2010), this new bound 
shows that the last iterations have the highest influence on the loss (the influence exponentially decreases towards 
the initial iterations). 

Theorem 3. With the notations of Theorem 1, after k iterations, the loss of AMPI satisfies 

||/.||p,p < s^T^ (c^')^ ll^^-llp,',. + E ll4-Jp,',, + 5(fc). 

i=l ^ i=0 ^ 

while the loss of CBMPI satisfies 

\\h\\p,p < 2j"^j2r~(^q"^'y iiefc-.-iiip,',, + E \Wk-^\u,,+9ik)■ 

1=1 ^ 1=0 

Proof. Again, we only detail the proof for AMPI (the proof being similar for CBMPI). We define X, [gi) and 
{Ji) as in the proof of Theorem 1. We then make as many groups as terms, i.e., for each n G {1, 2, . . . , 2A: — 1}, 
we define I„ = {n}. The result follows by application of Lemma 3. □ 
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E. Proof of Lemma 4 



The proof of this lemma is similar to the proof of Theorem 1 in Lazaric et al. (2010). Before stating the proof, 
we report the following two lemmas that are used in the proof. 

Lemma 6. Let 11 &e a policy space with finite VC-dimension h — VC{IV) < oo and N be the number of states 
in the rollout set Dk-i drawn i.i.d. from the state distribution fi. Then we have 



sup 



> e 



< 6 



with e = 16g„,axV F ^ + log f ) • 



Proof. This is a restatement of Lemma 1 in Lazaric et al. (2010). □ 

Lemma 7. Let 11 6e a policy space with finite VC-dimension h = VC{U) < oo and s^^\ . . . , s^^^ be an arbitrary 
sequence of states. At each state we simulate M independent rollouts of the form , then we have 



N M N 



with e = SQ^,.^^^{h log ^ + log f ) . 

Proof. The proof is similar to the one for Lemma 6. 



> e 



< 5 



□ 



Proof. (Lemma 4) Let a*(-) = argmaXjjg_^ Qk-i{'i a) be the greedy action. To simplify the notation, we remove 
the dependency of a* on states and use a* instead of a*{xi) in the following. We prove the following series of 
inequalities: 



(a) 



/:^-i(M;^fc) < /:^i(M;^fe) + e'i 



>f.p. 1-5' 



(b) 1 
< — 

- N 



N 

(c) 1 ^ 



i=l 
N 



E [Qk^li^iiO.*] - Qk-l{Xi,TT*{xi)) + e[ + 262 
1=1 

C^^An;n*) + e[ + 2e^ < C^_,U^;n*) + 2{e\ + e'^) 
inf /:n_i(M;^) + 2(e;+e'2). 



w.p. 1-2(5' 



w.p. 1 — 3(5' 
w.p. 1 - 4(5' 



The statement of the theorem is obtained by 5' = (5/4. 



(a) This follows from Lemma 6. 

(b) Here we introduce the estimated action- value function Qk-i by bounding 



sup 



1 ^ ^ 1 ^ 
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using Lemma 7. 

(c) From the definition of tt^ in CBMPI, we have 



1 ^ ^ 

TTfc = argmin£];^_i(^;7r) = argmax — ^ Q^^i (s^ ^ 7r(sW)) , 



N 

thus, — Qk-i{s^^\ 7''fc(s'*')) can be maximized by replacing TTfc with any other policy, particularly with 



TT* = argmin / { ma.xQk-i{s,a) - Qk-i(s,TT{s)) ] fj,{ds). 
Tren Js V"^-^ 



□ 
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F. Proof of Lemma 5 

Let us define two n-dimensional vectors z — ^[(T7r^)™WA:_i] (s^-^^), . . . , [(T^j^)™Wfc_i] (s'"))^ and y = 
{vk{s^^^), ■ ■ ■ ,Vk{s^"^))^ and their orthogonal projections onto the vector space Tn a.s z — Hz and y = 

fly = ),..., , where Vk is the result of linear regression and its truncation (by Kiax) is Wfc, 

i.e., Vk = T(vfc) (see Figure 2). What we are interested is to find a bound on the regression error \\z — y\\ (the 
difference between the target function z and the result of the regression y). We may decompose this error as 

\\z - y\\n < y\\n + \\z ~ z\\n = U\\n + \\z - z\\n, (41) 

where ^ = z — y is the projected noise (estimation error) ~ 11^, with the noise vector ^ z — y defined as 
= [(T^^)™Wfc_i] (s^*^) — i;fe(s('^). It is easy to see that noise is zero mean, i.e., E[S,i] = and is bounded by 
2Vinax, i-e., I^il < 2\4iiax- We may write the estimation error as 

\\z - y\\l = ml ^ilO^iU), 

where the last equality follows from the fact that ^ is the orthogonal projection of ^. Since ^ G let fa ^ J' 
be any function whose values at {s*-*-'}f^i equals to By application of a variation of Pollard's inequal- 

ity (Gyorfi et al., 2002), we obtain 

with probability at least 1 — 6'. Thus, we have 



11^- y||„ = UWn < 4K.ax^ I log (^ ^(Qey^+i ^ ^^2) 

From Eqs. 41 and 42, we have 



\\{T^,rvk-i-vkU<\mj''vk-i~fi{T.,rvk-i\\p^ m 

where Jl is the empirical norm induced from the n i.i.d. samples from fj,. 

Now in order to obtain a random design bound, we first define fs, € T as fs-A^'"^^) = [n(TTrj^)'"i'fc-i] (s^*^), and 
then define fa, = II{TTr^)"^Vk-i that is the best approximation (w.r.t. /i) of the target function {TTr^)"^Vk-i in 
J^. Since fs, is the minimizer of the empirical loss, any function in J- different than /s^ has a bigger empirical 
loss, thus we have 

ll/s. - {n,rvk-ih < II/-. - inj"'vk-ih < 2|l/a. - (T^J™t^fe-l|U 
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+ 12(Kiax + ||a*||2 sup||0(x)||2)y^log- , (44) 

with probability at least 1 — (5', where the second inequality is the application of a variation of Theorem 11.2 in 
the book by Gyorfi et al., (2002) with \\fa, - (T^J"t^fe-i||oo < Kiax + ||a;*||2 sup^ ||</'(a:^)||2- Similarly, we can 
write the left-hand-side of Equation 43 as 



mT.,rvk-i-Vkh > 2\\{T^^rvk-i-T{dk)h > UT^.rvk^^ - T{vk)\\f, - 2AV^,^^^A{n,d,S'), (45) 

with probability at least 1 - 5', where A(n, d, S') = 2{d + 1) log n -I- log f + log (9(12e)^('*+i)) . Putting together 
Equations 43, 44, and 45 and using the fact that T{vfi) — Vk, we obtain 




The result follows by setting 6 = 3(5' and some simplification. 
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Figure 3. Performance of the learned policies in mountain car with two different 2x2 RBF grids, the one with good 
approximation of the value function is on the left and the one with poor performance in approximating the value function 
is on the right. The total budget B is set to 200. The objective is to minimize the number of steps to the goal. 



G. Experimental Results 

In this section, we report the empirical evaluation of CBMPI and compare it to DPI and LSPI. In the experiments, 
we show that CBMPI, by combining policy and value function approximation, can improve over DPI and LSPI. 
In these experiments, we are using the same setting as in Gabillon ct al. (2011) to facilitate the comparison. 

G.l. Setting 

We consider the mountain car (MC) problem with its standard formulation in which the action noise is bounded 
in [—1, 1] and 7 = 0.99. The value function is approximated using a linear space spanned by a set of radial basis 
functions (RBFs) evenly distributed over the state space. 

Each CBMPI-based algorithm is run with the same fixed budget B per iteration. CBMPI splits the budget into 
a rollout budget Bji = B{1 — p) used to build the training set of the greedy step and a critic budget Be — Bp 
used to build the training set of the evaluation step , where p S (0, 1) is the critic ratio. The rollout budget is 
divided into M rollouts of length m for each action in A and each state in the rollout set V , i.e., Bj^ = toA/A^|^|. 
The critic budget is divided into one rollout of length m for each action in A and each state in the rollout set 
V, i.e.. Be = mn\A\. 

In Fig. 3, we report the performance of DPI, CBMPI, and LSPI. In MC, the performance is evaluated as 
the number of steps-to-go with a maximum of 300. The results are averaged over 1000 runs. We report the 
performance of DPI and LSPI at p = and p = 1, respectively. DPI can be seen as a special case of CBMPI 
where p = 0. We tested the performance of DPI and CBMPI on a wide range of parameters (m, M, N, n) but 
we only report their performance for the best choice of M {M = 1 was the best choice in all the experiments) 
and different values of m. 

G.2. Experiments 

As discussed in Remark 5, the parameter m balances between the error in evaluating the value function and 
the error in evaluating the policy. The value function approximation error tends to zero for large values of 
m. Although this would suggest to have large values for m, the size of the rollout sets would correspondingly 
decrease as iV = 0{B/m) and n = 0{B/m), thus decreasing the accuracy of both the regression and classification 
problems. This leads to a trade-off between long rollouts and the number of states in the rollout sets. The solution 
to this trade-off strictly depends on the capacity of the value function space T. A rich value function space would 
lead to solve the trade-off for small values of to. On the other hand, when the value function space is poor, or 
as in the DPI case, to should be selected in a way to guarantee a sufficient number of informative rollouts, and 
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at the same time, a large enough roUout sets. 

Figure 3 shows the learning results in MC with budget B — 200. On the left panel, the function space is rich 
enough to approximate v* . Therefore LSPI has almost optimal results (about 80 steps to reach the goal). On 
the other hand, DPI achieves a poor performance of about 150 steps, which is obtained by setting m — 12 and 
N = 5. We also report the performance of CBMPI for different values of m and p. When p is large enough, 
the value function approximation becomes accurate enough so that the best solution is to have m = 1. This 
both corresponds to rollouts built almost entirely on the basis of the approximated value function and to a large 
number of states in the training set N. For m = 1 and p w 0.8, CBMPI achieves a slightly better performance 
than LSPI. 

In the next experiment, we show that CBMPI is able to outperform both DPI and LSPI when T has a lower 
accuracy. The results are reported on the right panel of Figure 3. The performance of LSPI now worsens to 
190 steps. Simultaneously one can notice m = 1 is no longer the best choice for CBMPI. Indeed in the case 
where m — 1, CBMPI becomes an approximated version of the value iteration algorithm relying on a function 
space not rich enough to approximate u*. Notice that relying on this space is still better than setting the value 
function to zero which is the case in DPI. Therefore, we notice an improvement of CBMPI over DPI for m = A 
which trade-off between the estimates of the value function and the rewards collected by the rollouts. Combining 
those two, CBMPI also improves upon LSPI. 



